-
Notifications
You must be signed in to change notification settings - Fork 231
Rate limit restructure (server-side) #933
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Ha. Interesting - I would have thought the ++ would have caused an error. Good to know! |
It did! However, the test was wrong as well. because the sliding window was set to 0, no timestamps were actually stored for the peers. So that part of the code never ran, until I started up a node, and started to experiment with the config values. |
humaite
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks okay to me at first glance. i will probably re-read it later.
| PARAMS="-pa ${TEST_PATH} ${TEST_PATH_BASE} -config ${TEST_CONFIG} -noshell" | ||
| TEST_PATH="$(./rebar3 as ${TEST_PROFILE} path)" | ||
|
|
||
| ## TODO: Generate path for all apps -> Should we fetch this from somewhere? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In theory, those paths are generated using rebar3, but it seems for the case of tests, it was not correctly done.
|
|
||
| config_peer_to_ip_addr({A, B, C, D, _}) -> {A, B, C, D}. | ||
|
|
||
| path_to_limiter_ref([<<"chunk">> | _]) -> chunk; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand correctly, this function is reading the path from the client request, and will return the kind of limiter used. Using a data structure like a proplist or a map can probably be easier to modify in the future. Not sure if could be easy to do right now though.
The problem I see here is, if we are adding a new path, we will need to add it in this function (at least if we need a special rate limit for it).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's certainly going to some work to add a new rate limiter group.
- add Rate Limiter Group (RLG) process to
arweave_limiter_sup - add Rate Limiter Group config parameters as fields to arweave_config
configrecord.
2b) add defaults for the parameters. - add Rate Limiter routing logic to
ar_http_iface_rate_limiter_middleware
IMHO, this mapping function is a quite idiomatic way to map paths to atoms. (No dynamically named atoms, no need for lookups)
- I'd debate that 2) is a bigger burden than 3) (extending this function).
- I'd like to point out: this PR doesn't aim to introduce dynamic RLGs.
- With a different approach on how we deal with the configuration, I think we can introduce RLGs started in a dynamic manner in the future, if necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are not adding a lot of new end-point, but perhaps documenting this procedure somewhere could be nice.
|
The |
| execute(Req, Env) -> | ||
| IPAddr = requesting_ip_addr(Req), | ||
| {ok, Config} = arweave_config:get_env(), | ||
| case lists:member(blacklist, Config#config.disable) of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be honest I didn't realize we had a disable blacklist option, and I'm not sure if it is currently in use. It looks like it would completely disable rate limiting on the server side. @ldmberman @humaite are you aware of any nodes that use it?
It's possible some large miners use it on internal nodes used to server data to mining nodes (for example). But I'm not sure it would even be that effective since by default clients will self-throttle regardless of the server-side behavior.
This is a long way of asking:
- @khetzl did you mean to remove this option? (e.g. because you've replaced it with another configuration option)
- Even if the removal was unintentional... I'm not sure we actually need or want this particular flag. In general the
enableanddisableoptions are poorly understood and rarely used, so if we want to disable server-side rate limiting a more formal config is a better choice. Only question is whether anyone is using the current flag?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I intended to remove this, we shouldn't allow unrestricted access to resources to the public.
Trusted peers can be listed so their requests are limited via a less restrictive Rate Limiter Group. (Currently, no limiting at all)
Note: Even if the disable blacklist flag is enabled in the current release, a global concurrency limit is enforced by cowboy/ranch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. I agree it's correct to remove. However worth providing some context:
- In cases like this where it is the right call to remove an option, we may still need to flag and communicate the removal for any miners that are relying on it. Both in the release notes and in Discord. I suspect this option isn't used much (or at all) so probably fine to silently remove.
- Many miners implement their own security in front of their node. Specifically they will run nodes on a LAN without direct internet connection - so even if they opt in to using the
disable blacklistflag it doesn't necessarily mean they or we are providing unrestricted access to the public. - I don't think this flag is used, but if it is used it is likely due to the above. We've had a few miners run large networks with many internal (non publicly accessible nodes). In the past they tried using
local_peersbut since arweave doesn't currently support any sort of wildcards/subnetting, and because the ndoe shutdown/boot time can take a long time, some of the miners had trouble managing rate limiting for their internal nodes. It's in that scenario where I could seedisable blacklistbeing useful (setup an internal data-sharing node, disable blacklist, and then let all your other nodes connect to without worrying about needing to restart periodically to update local_peers)
All that said: I do think fine to remove. But I wanted to call out that context as our operating model is often different than that of non-blockchain companies (e.g. we build infrastructure software to be run by 3rd parties as well as by us internally - and those 3rd parties vary from novice to enterprise-level)
The Yes, the client-side still relies on this. You're right, it will be removed later. Probably, the most important change we want to implement on the client side is peers not relying on their local configuration for throttling, but altering their behaviour (throttle) dynamically according to the remote peer's reponse headers. That renders the |
| 'http_api.limiter.get_previous_vdf_session.is_manual_reduction_disabled' = | ||
| ?DEFAULT_HTTP_API_LIMITER_IS_MANUAL_REDUCTION_DISABLED, | ||
|
|
||
| 'http_api.limiter.metrics.sliding_window_limit' = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@khetzl @humaite What's the recommended approach for the new config system regarding how deep to nest configs?
E.g. Is there a clear preference http_api_limiter.metrics.sliding_window_timestamp_cleanup_interval vs. http_api_limiter.metrics.sliding_window.timestamp_cleanup_interval?
Especially in a JSON format I could see grouping all the sliding_window or leaky_tick options together making it a little easier to keep everything straight.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no preference, I'm happy to introduce further grouping, if you guys think it makes more sense.
I tried to keep it as close to what we use internally in the RLG implementation for readability, but . vs _ wouldn't make much of a difference in this sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 I don't feel strongly either. We can defer to @humaite and if he doesn't have a clear preference, leave as is.
| %%% @doc Arweave Rate Limiter. | ||
| %%% | ||
| %%% `arweave_limiter' module is an interface to the Arweave | ||
| %%% Rate Limiter functionality. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice. I like this approach - really helps with modularization 👍
| Accept | ||
| end. | ||
|
|
||
| reduce_for_peer(LimiterRef, Peer) -> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is this called? I only see it called from ar_http_iface_middleware:handle_post_tx_accepted. I was expecting to see it called whenever an inbound HTTP request is processed, but perhaps I misunderstood
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yeah looks like I misunderstood. Reading the comment reduce_for_peer is a special case so that we don't penalize peers for sharing valid transactions with us. The normal leaky bucket and window counter tracking is handled elswhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add a comment to this function to describe it? Main relevant point that I see is that this function is not the standard way to reduce the request count for a peer, but is an edge case handler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's correct. The previous solution immediately reduces when a transaction is accepted, to deal with high load.
I'd like to point out: this is a double reduction, we are just following the previous behaviour and should be rather handled with correct set up of config parameters.
| Timestamps. | ||
|
|
||
| add_and_order_timestamps(Ts, Timestamps) -> | ||
| lists:reverse(do_add_and_order_timestamps(Ts, lists:reverse(Timestamps))). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| lists:reverse(do_add_and_order_timestamps(Ts, lists:reverse(Timestamps))). | |
| %% Timestamps is ordered oldest to newest. However we expect Ts to usually be the most recent timestamp. To optimize for this case we reverse, add (likely to the front), then re-reverse | |
| lists:reverse(do_add_and_order_timestamps(Ts, lists:reverse(Timestamps))). |
Or some other comment just to explain the double-reverse
| ok = prometheus_histogram:new([ | ||
| {name, ar_limiter_response_time_milliseconds}, | ||
| {help, "Time it took for the limiter to respond to requests"}, | ||
| {buckets, [infinity]}, %% For meaningful results on this, we need buckets |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without buckets we can still get an average over time because the histogram includes xxx_count and xxx_sum fields. Most of our internal charts just use this average over time value to track performance changes. But if some degree of bucketing would be helpful for this metric (e.g. if the distribution is skewed per timeslice), defintely add them!
We have a broader issue of how to deal with our growing number of metrics, but I think we'll likely need to tackle that holistically - until then feel free to add whatever metrics you think are necessary
|
PR looks great! Most of my comments are minor. One annoying comment: unfortunately we're still using tab indenting rather than space. We'd like to reformat everything, but until then would be good to have new code indented with tabs. One question: have you been able to run any performance tests using the new code? My guess is the impact of the extra logic is minimal. Only potential bottleneck I can see if the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cursor Bugbot has reviewed your changes and found 4 potential issues.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
This PR is being reviewed by Cursor Bugbot
Details
Your team is on the Bugbot Free tier. On this plan, Bugbot will review limited PRs each billing cycle for each member of your team.
To receive Bugbot reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cursor Bugbot has reviewed your changes and found 4 potential issues.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cursor Bugbot has reviewed your changes and found 3 potential issues.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cursor Bugbot has reviewed your changes and found 3 potential issues.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
| {noreply, State}. | ||
|
|
||
| terminate(_Reason, _State) -> | ||
| ok. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Timers not canceled on termination causing resource leak
Medium Severity
The terminate/2 callback ignores the state and doesn't cancel the timers created by timer:send_interval in init/1. Timer refs are stored in state as leaky_tick_timer_ref and timestamp_cleanup_timer_ref, but when the process terminates (crash, restart, shutdown), these timers remain active in the global timer server. They continue firing indefinitely, sending messages to a dead pid. Each process restart leaks two timer entries that accumulate over time, consuming memory and CPU in the timer server.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, he's commenting this over and over again, but cancelling would crash the tests, so we are not doing it. these processes should live for the lifetime of the node anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe if you resolve all instances of the comment it will stop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh looks like a known issue: https://forum.cursor.com/t/bugbot-should-filter-out-previously-resolved-dismissed-issues/147748
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cursor Bugbot has reviewed your changes and found 3 potential issues.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cursor Bugbot has reviewed your changes and found 4 potential issues.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
| RequestFun, | ||
| ErrorResponse, | ||
| Config#config.requests_per_minute_limit div 2 + 1, | ||
| LimitWithBursts, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test expects wrong error count for rate limiting
Low Severity
In test_node_blacklisting_get_spammer, the test sends exactly sliding_window_limit + leaky_limit (300) requests but expects 1 error. Based on the rate limiter logic, all 300 requests should pass: the first 150 via sliding window and the next 150 via leaky bucket tokens. The first rejection occurs at request 301. The test only passes because the tolerance check Got >= ExpectedErrors - Tolerance evaluates to Got >= -4, which is always true.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is incorrect, the test asserts a range correctly
3a3502d to
ec826aa
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
apps/arweave/src/ar_config.erl
Outdated
| %% RATE LIMITER GENERAL | ||
| parse_options([{<<"http_api.limiter.general.sliding_window_limit">>, Limit}|Rest], Config) -> | ||
| case Limit of | ||
| Limit when is_integer(Limit), Limit > 0 -> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Config validation rejects valid zero sliding window limit
Medium Severity
The config validation for sliding_window_limit options requires Limit > 0, but multiple default values in arweave_config.hrl are set to 0 (e.g., DEFAULT_HTTP_API_LIMITER_DATA_SYNC_RECORD_SLIDING_WINDOW_LIMIT, DEFAULT_HTTP_API_LIMITER_BLOCK_INDEX_SLIDING_WINDOW_LIMIT, etc.). A value of 0 is valid and means "skip sliding window, use only leaky bucket." Users cannot explicitly configure this behavior via the config file since validation would reject it with a bad_value error.
correct tick reduction config parameter further alignment of config params
e9b24f7 to
12d1605
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cursor Bugbot has reviewed your changes and found 3 potential issues.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
| {help, "Time it took for the limiter to respond to requests"}, | ||
| %% buckets might be reduced for production | ||
| {buckets, [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5, 10, 50]}, | ||
| {labels, [limiter_id]}]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Histogram buckets misconfigured for microsecond measurements
Medium Severity
The histogram ar_limiter_response_time_microseconds has buckets [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5, 10, 50] but timer:tc/2 returns values in microseconds. A typical gen_server call takes 50-500+ microseconds, meaning virtually all observations will overflow into the +Inf bucket, rendering the histogram useless for monitoring and debugging. The buckets need to be scaled appropriately for microsecond values (e.g., [10, 50, 100, 500, 1000, 5000, 10000, 50000]).
Additional Locations (1)
| {stop, reject(Req, Reason, Data)}; | ||
| _ -> | ||
| {ok, Req, Env} | ||
| end. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing config option to disable rate limiting globally
Medium Severity
The old ar_blacklist_middleware allowed users to disable rate limiting by adding "blacklist" to the disable config list. The new ar_http_iface_rate_limiter_middleware doesn't check Config#config.disable for this option. Users who configured "disable": ["blacklist"] to bypass rate limiting will unexpectedly start receiving 429 errors after upgrading, as this configuration is silently ignored by the new middleware.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cursor Bugbot has reviewed your changes and found 3 potential issues.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
| {help, "Time it took for the limiter to respond to requests"}, | ||
| %% buckets might be reduced for production | ||
| {buckets, [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5, 10, 50]}, | ||
| {labels, [limiter_id]}]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Histogram buckets mismatch with microsecond measurement unit
Medium Severity
The ar_limiter_response_time_microseconds histogram has buckets [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5, 10, 50], but timer:tc/2 returns time as integer microseconds. A typical gen_server call taking 100 microseconds would return Time = 100, which exceeds the highest bucket (50), causing all observations to fall only in the +Inf bucket. The buckets appear designed for seconds rather than microseconds, making the histogram distribution data useless for observability.
Additional Locations (1)
|
|
||
| terminate(_Reason, #{leaky_tick_timer_ref := _LeakyRef, | ||
| timestamp_cleanup_timer_ref := _TsRef} = _State) -> | ||
| ok. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interval timers not cancelled on gen_server termination
Medium Severity
The terminate/2 callback matches on leaky_tick_timer_ref and timestamp_cleanup_timer_ref but doesn't call timer:cancel/1 on them. Timers created with timer:send_interval/3 persist in the timer server until explicitly cancelled. When the gen_server terminates (due to crash, restart, or stop), these timers continue firing and accumulate as resource leaks. Other modules in the codebase properly cancel timers during cleanup.
Additional Locations (1)
|
|
||
| reset_node() -> | ||
| ar_blacklist_middleware:reset(), | ||
| arweave_limiter_sup:reset_all(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Peer limiter not reset causing inconsistent test state
Low Severity
The reset_node/0 function resets the local arweave_limiter_sup but only resets the peer's ar_blacklist_middleware, not the peer's limiter. This creates an asymmetric reset where the local node has a clean limiter state but peer1 retains its previous limiter state, potentially causing test flakiness or incorrect rate limiting behavior when tests interact with the peer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cursor Bugbot has reviewed your changes and found 3 potential issues.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cursor Bugbot has reviewed your changes and found 3 potential issues.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
Multiple rate limiter actors to be called from middleware/cowboy_handler processes.
Implementation based on Leaky Bucket Token (https://gist.github.com/humaite/21a84c3b3afac07fcebe476580f3a40b) with additional limits for concurrency (similarly to Ranch concurrency limiter).
It could potentially allow complex rules set to map requests to limiter pools and/or custom settings via (reworked) ar_config.
The current implementation is not "wired" in. So the new middleware module is not added to
ar_http_iface_server, and the supervisor is not yet part of thear_nodesupervision tree.The functionality could be extended with values helpful to regulate clients via the
Status 429response headers.Note
Replaces legacy request throttling with a new, modular server-side rate limiter and integrates it into the HTTP stack.
arweave_limiterapp (supervisor,arweave_limiter_group, time helper, metrics + collector) with sliding-window + leaky-bucket + concurrency controls per limiter groupar_http_iface_rate_limiter_middleware(mapped by path; local peers bypass) and replacesar_blacklist_middlewareinar_http_iface_serverar_blacklist_middlewareto ban-only; removes in-request rate limiting;handle_post_tx_acceptednow callsarweave_limiter:reduce_for_peer(general, IP)ar_configandarweave_config.hrlwith comprehensive limiter settings (per group: general, chunk, data_sync_record, recent_hash_list_diff, block_index, wallet_list, get_vdf, get_vdf_session, get_previous_vdf_session, metrics)ar.erl,ar_test_node.erl,ar_test_runner.erl); updatesrebar.config(app inclusion, meck bump) and test runner script to include multi-app test pathsar_http_iface_teststo use new limits and resets; CI workflows include new test modules.gitignoretweak for backup filesWritten by Cursor Bugbot for commit c0d2190. This will update automatically on new commits. Configure here.